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Abstract. We study the problem of eliminating recursion from monadic 
datalog programs on trees with an infinite set of labels. We show that 
the boundedness problem, i.e., determining whether a datalog program is 
equivalent to some nonrecursive one is undecidable but the decidability 
is regained if the descendant relation is disallowed. Under similar restric¬ 
tions we obtain decidability of the problem of equivalence to a given 
nonrecursive program. We investigate the connection between these two 
problems in more detail. 

1 Introduction 

Among logics with fixpoint capabilities, one of the most prominent is datalog, 
which augments unions of conjunctive queries (positive existential first order 
formulae) with recursion. Datalog originated as a declarative programming lan¬ 
guage, but later found many applications in databases as a query language. The 
gain in expressive power does not, however, come for free. Compared to unions 
of conjunctive queries, evaluating a datalog program is harder |22j and basic 
properties such as containment or equivalence become undecidable m- 

Since the source of the difficulty in dealing with datalog programs is their 
recursive nature, the first line of attack in trying to optimize such programs is to 
eliminate the recursion. It is well-known that a nonrecursive datalog program can 
be rewritten as a union of conjunctive queries. The main focus of this paper is 
therefore the equivalence of recursive datalog programs to unions of conjunctive 
queries. 

Example 1. The programs in this example work on databases that use binary 
predicates likes and knows, and a unary predicate trendy. First, consider the 
following pair of datalog programs: 


V[ 

buys(X, Y) <r- likes{X,Y) 


Vi 

buys(X, Y) likes{X,Y) 


buys(X , Y) <— trendy(X),buys(Z, Y) buys{X , Y) ■k— trendy(X),likes(Z, Y) 

The program V\ is recursive because its second rule refers to the predicate buys. 
It can be shown that "Pi is equivalent to the nonrecursive program V[. Consider, 
on the other hand, the following pair of programs: 


v 2 V 2 

buys(X, Y ) <- likes(X, Y) buys(X, Y) «- likes{X, Y) 

buys(X, Y) knows(X, Z ), buys(Z , Y) buys(X, Y) <— knows(X, Z), likes(Z, Y) 


It can be shown that V 2 is not equivalent to the nonrecursive program V' 2 . More¬ 
over, this program is not equivalent to any nonrecursive program. 

The example above (taken from llS] I presents two approaches to eliminat¬ 
ing recursion from datalog programs. Either we want to determine for a given 
datalog program if it is equivalent to some nonrecursive datalog program or 
decide whether a given datalog program is equivalent to a given nonrecursive 
program. These problems bear some similarities but in general they are separate. 
The latter is decidable m, while the former, called the boundedness problem, 
is not ll'illtil . 

Negative results for the full datalog fueled interest in its restrictions SEE!. 
Important restrictions include monadic programs, using only unary predicates 
in the heads of rules; linear programs, with at most one use of an intensional 
predicate per rule; and connected programs, where within each rule all variables 
that are mentioned are connected to each other. Throughout this paper only 
monadic datalog programs are considered. In .12; Cosmadakis et al. show that 
for such programs the boundedness problem becomes decidable. Moreover, they 
use the same techniques to prove that the containment problem of two monadic 
datalog programs is decidable. These results suggest that under some additional 
assumptions the boundedness problem and the equivalence problem are more 
related. 

In this paper we study connected, monadic datalog programs restricted to 
tree-structured databases. Our models are finite trees whose nodes carry labels 
from an infinite alphabet that can be tested for equality. Over such structures 
the problem of equivalence to a given union of conjunctive queries is known 
to be undecidable mm. We show that the boundedness problem is also unde¬ 
cidable. In some cases, however, we regain decidability of both problems in the 
absence of the descendant relation. On ranked trees we show that the equivalence 
and the boundedness problems become decidable (in 2-ExpTime). On unranked 
trees we prove that the equivalence of a linear program to a non-recursive one is 
ExpSPACE-complete. We finish with an analysis of the connection between the 
equivalence and the boundedness problems and show that under some assump¬ 
tions they are equi-decidable. 

Organization. In Section [5] we introduce datalog programs and some basic 
definitions. In Section [3] we deal with the problem of equivalence to a given non¬ 
recursive datalog program. In Section [I] we analyze the boundedness problem. 
Finally, in Section [5] we explore the connection between the two approaches to 
eliminating recursion from datalog programs and show that under some assump¬ 
tions the arising decision problems are equi-decidable. We conclude in Section [6] 
with possible directions for future research. Due to the page limit most of the 
proofs are moved to the appendix. 







2 Preliminaries 


In this paper we work over finite trees labeled with letters from an infinite 
alphabet E. The trees are unranked by default, but we also work with ranked 
trees, in particular with words. We use the standard notation for axes: 4-, 4-+ stand, 
respectively, for child and descendant relations. We assume that each node has 
one label. A binary relation ~ holds between nodes with identical labels and 
there is a unary predicate a for each a € E, holding for the nodes labeled with 
a. 

We begin with a brief description of the syntax and semantics of datalog; for 
more details see [2] or [S]. A datalog program V over a relational signature 
S' is a finite set of rules of the form head <— body , where head is an atom 
over S and body is a (possibly empty) conjunction of atoms over S written as a 
comma-separated list. All variables in the body that are not used in the head are 
implicitly quantified existentially. The size of a rule is the number of different 
variables that appear in it. 

The relational symbols, or predicates, in S fall into two categories. Exten- 
sional predicates are the ones explicitly stored in the database; they are never 
used in the heads of rules. In our setting they come from {|,4-+,^} U E. The 
alphabet E is infinite, but the program V uses only its finite subset which we 
denote by E-p. Intensional predicates, used both in the heads and bodies, are 
defined by the rules. 

The program is evaluated by generating all atoms (over intensional predi¬ 
cates) that can be inferred from the underlying structure (tree) by applying 
the rules repeatedly, to the point of saturation. Each inferred atom can be wit¬ 
nessed by a proof tree: an atom inferred by a rule r from intensional atoms 
A \, A 2 ,..., A n is witnessed by a proof tree with the root labeled by r, and n chil¬ 
dren which are the roots of the proof trees for atoms Ai (if r has no intensional 
predicates in its body then the root has no children). 

There is a designated predicate called the goal of the program. We will often 
identify the goal predicate with the program, i.e., we write V{X) if the goal 
predicate of the program V holds on the node X. When evaluated in a given 
database D, the program V results in the unary relation V(D) = {X £ D \ 
such that V(X) holds}. If V(D) C Q(D) for every database D then we say that 
the program V is contained in the program Q. If the containment holds both 
ways then the programs V and Q are equivalent. 

Example 2. The program below computes the nodes from which one can reach 
some label a along a path where each node has a child with identical label and 
a descendant with label b (or has label b itself). 


P(X) 4- XIY, P(Y),X\y', X ~ Y\ Q(X ) ( Pl ) 


Pi 


c 


P(X) 4- a(A') 


(P2) 

(?l) 

(®0 


/ \ 


/\ 


Q(X) XIY,Q(Y) 


Qi Pi 


c b 


Q{X) 4- b(X ) 


/ / \ 


/ \ 


<?2 <72 P2 


b a 


The intensional predicates are P and Q, and P is the goal. The proof tree shown 
in the center witnesses that P holds in the root of the tree on the right. 

The notion of proof trees comes from papers on datalog over general struc¬ 
tures (see e.g. my As shown in Example [2] proof trees illustrate how the pro¬ 
gram evaluates. While on general structures for a given proof tree one can always 
find a model such that the proof tree witnesses a correct evaluation of the pro¬ 
gram, on tree structures this is not so simple. One reason is that we allow only 
one label for every node. As a result, rules like P(X) a(X),b(X) cannot be 
satisfied for a ^ b. Moreover, nodes have a unique father. Because of this it is 
not easy to determine whether a given proof tree is a witness of an evaluation 
of the program on some model and it does not suffice to eliminate unsatisfiable 
rules. Proof trees for which such a model exists will be called satisfiable proof 
trees. 

Example 3. The program below goes down a tree along a path labeled with a. 
Then it goes up the tree until it finds a node labeled with b. 

P3 

P(X)^XlY,a(Y),P(Y) (p 3 ) P * £ 

P(X)<-Q(X) (p 4 ) P4 i 

Q(X)^YIX,Q(Y) (g 3 ) ^ P l 

Q(X) <r~ b(X) ( 94 ) i 93 

94 i 

9 4 

The first proof tree is satisfiable, but the second proof tree is not satisfiable 
because it enforces both labels a and b on the same node. 

In this paper we consider only monadic programs, i.e., programs whose 
intensional predicates are at most unary. Moreover, throughout the paper we 
assume that the programs do not use 0-ary intensional predicates. For general 
programs this is merely for the sake of simplicity: one can always turn a 0-ary 
predicate Q to a unary predicate Q(X) by introducing a dummy variable X. For 
connected programs (described below) this restriction matters. 

For a datalog rule r, let G r be a graph whose vertices are the variables 
used in r and an edge is placed between X and Y if the body of r contains 
an atomic formula X\.Y or X\, + Y. In G r we distinguish a head node and 
intensional nodes. The latter are all variables from the body of r used by 
intensional predicates. A program V is connected if for each rule r £ P, the 
graph G r is connectec@. 

Previous work on datalog on arbitrary structures often considered the case of 
connected programs mm • The practical reason is that real-life programs tend 
to be connected (cf. [3]). Also, rules which are not connected combine pieces of 

1 One could consider a definition allowing additionally nodes connected by the equality 
relation but we expect that this would be as hard as the disconnected case e.g. the 
main problem we leave open in Section [3J the equivalence of child-only non-linear 
programs, becomes undecidable by the results of m for boolean queries. 





unrelated data, corresponding to the cross product , an unnatural operation in 
the database context. It seems even more natural to assume connectedness when 
working with tree-structured databases. We shall do so. We write DatalogQ,, j_ + ) 
for the class of connected monadic datalog programs, and Datalog(4,) for con¬ 
nected monadic programs that do not use the relation 4-+. 

A datalog program is linear if the right-hand side of each rule contains at 
most one atom with an intensional predicate (proof trees for such programs are 
single branches). For linear programs we shall use the letter L, e.g., L-Datalog(4.) 
means linear programs from Datalog(j.). The program from Example [2] is con¬ 
nected, but not linear. The program from Example [3] is both connected and 
linear. 

Conjunctive queries (CQs) are existential first order formulae of the form 
3a:i... Xk <p, where ip is a conjunction of atoms. We will consider unions of 
conjunctive queries (UCQs), corresponding to nonrecursive programs with 
a single intensional predicate (goal) which is never used in the bodies of rules. 
Since UCQs can be seen as datalog programs, we can speak of connected UCQs 
and as for datalog, we shall always assume connectedness. We denote the classes 
of connected queries by CQ(4-, J.+), CQQ,), UCQ(4-, J.+), UCQQ,), respectively. 

3 Equivalence 

For datalog programs the containment problem can be reduced to the equivalence 
problem. Let V be a datalog program and let Q be a UCQ. Then V C Q iff 
V V Q = Q. Notice that this reduction does not depend on the type of the 
programs (e.g., disallowing j, + relation; or assuming linearity) but relies on the 
fact that datalog programs are closed under the disjunction. 

The containment problem for datalog programs has been studied on trees 
in other contexts [115114117] . In |T7] containment of datalog programs in UCQs 
on data trees was analyzed in detail for boolean queries, which are queries that 
return the answer ’yes’ if they are satisfied in some node of a given database, 
and the answer ’no’, otherwise. More formally, a datalog program V defines a 
boolean query Vbooi{D) which equals 1 iff V(D) is nonempty and 0 otherwise. 

The containment problem is usually solved by considering the dual problem. 
For unary queries, it is the question whether there exist a database D and X £ D 
such that V{X) and -'Q(X), where ->Q = D\Q(D). For boolean queries, it is the 
question if there exist a database D and X,Y € D such that V(X) and ->Q(Y). 
For datalog programs over trees, if we allow the | + relation this distinction does 
not make much of a difference (intuitively because using j, + one can move from 
a node X to any node Y). Thus a closer look at the proofs of Theorem 1 and 
Proposition 3 from El gives the following. 

Proposition 4. Over ranked and unranked trees the containment problem of 
L-DatalogQ,, j,+) programs in UCQQ,, j, + ) is undecidable. 

In the rest of this section we work only with fragments of datalog without the 
j, + relation. We start with ranked trees. 




Theorem 5. The containment problem is 2 -ExpTime- complete for DatalogQ,) 
over ranked trees. In the special case of words it is PSpace -complete. 

The above result yields tight complexity bounds for the equivalence prob¬ 
lem of Datalog(4) programs to UCQ(4) programs over ranked trees. To prove 
Theorem [5] (see Appendices IB.II and IB.21) we define automata that simulate the 
behavior of datalog programs, modifying the approach of |17j_. The new construc¬ 
tion gives better complexity results for non-linear programqj. 

In the rest of this section we focus on the equivalence problem of DatalogQ,) 
programs to UCQ(4.) programs over unranked trees. For the containment prob¬ 
lem, this question was left open in [I]. 

For boolean queries, the containment problem of Datalog(4_) programs in 
UCQ(|) programs was proved undecidable in [IT]. Decidability was restored for 
the linear fragment, for which it was shown to be 2-ExpTiME-complete. We 
improve the complexity for unary queries using different techniques (see Appen¬ 
dices IB. 41 and El. 

Theorem 6. The containment problem of an L-DatalogQ,) program in a UCQQ,) 
program is ExpSpace -complete over unranked trees. 

Unfortunately our approach does not generalize to the non-linear case. On the 
other hand, the proof of undecidability provided in EH also cannot be adapted 
to work in our setting. We leave the question of the decidability of containment 
for non-linear programs as an open problem. 

The following lemma is proved in Appendix IB. fil lwe do not assume linearity). 

Lemma 7. The containment problem of UCQ(4,) queries in Datalog(4_) is in 
NPTime over ranked and unranked trees. 

As a corollary of Theorem [6] and Lemma [7] we obtain the main result of this 
section. The lower bound is carried from the containment problem. 

Theorem 8. The equivalence problem of an L-DatalogQ,) program to a UCQ(4.) 
program is ExpSpACE- complete over unranked trees. 


4 Boundedness 

Consider a datalog program V with a goal predicate P. By V l {D) we denote the 
collection of facts about the predicate P that can be deduced from a database 
D by at most i applications of the rules in V. More formally, P Z (D) is the subset 

2 hi HI] the non-linear case required an additional exponential blow-up. However, the 
improvement of complexity is not caused by considering unary instead of boolean 
queries. It is easy to see that Theorem [5] holds also in the boolean case. 

3 Indeed, the main idea of the undecidability proof is to use the UCQ Q to find errors 
in the run of a Turing machine encoded by the program V. If the nonrecursive query 
Q is unary it can only find errors close to the node X, such that V(X). 






of T(D) derived using proof trees of height at most i, where the height of a tree 
is the length of the longest path from its root to a leaf. Then obviously 


V(D) = (J V\D). 

i> 0 

We say that the program V is bounded if there exists a number n , depending 
only on V, such that for any database D , we have V{D) = V n {D). Intuitively 
this means that the depth of recursion is independent of the input database^- 
Each proof tree corresponds to a conjunctive query in a natural way. There¬ 
fore, we can always translate a datalog program to an equivalent, but possibly 
infinite, union of conjunctive queries. If the program is bounded then it is equiva¬ 
lent to a finite subunion of its corresponding conjunctive queries. For full datalog 
it is known that the opposite implication is also true, i.e., a program is bounded 
iff it is equivalent to a (finite) UCQ [IT?] . The same holds for the class Datalog(j,): 

Proposition 9. Let V £ Datalog)).). Then V is hounded iff it is equivalent to a 
union of conjunctive queries Q £ UCQ)),). 

We remark that the above characterization (which we prove in Appendix ICl) 
is based on the existence of so-called canonical databases for CQs (see e.g. [TO] ) 
in Datalog)),). The following example shows that without canonical databases 
equivalence to some UCQ does not necessarily imply boundedness. It relies on 
the fact that ).+ is the transitive closure of ),. 

Example 10. The program V £ Datalog)),,). + ) on the left is not bounded - find¬ 
ing h in a tree can take arbitrarily long. The program V' on the right is a UCQ 
equivalent to V. 

P(X) <e- Xi+Y,a(Y) 

P{X) <e- XfY, Q(Y) P(X) <r- Xf+Y, a(Y) 

V Q(X) <- XfY, Q(Y) V P{X) <-X| + Y,6(T) 

Q(X) <- b(X) 


We obtain a negative result for L-Datalog))., ). + ) (see Appendix lC.il) . 

Theorem 11. The boundedness problem for L-Datalog)).,). + ) is undecidable over 
words and ranked or unranked trees. 

In the following we work with fragments of datalog without the ). + relation. For 
decidability results we use the automaton-theoretic approach of [12] . 

Theorem 12. The boundedness problem for Datalog)),) over words is in PSpace. 

4 Observe that we are only interested in the output on the goal predicate. This is why 
the property we consider is sometimes called the predicate boundedness m- 



In the case of trees the same technique can be applied but the complexity in¬ 
creases (see Appendix 1C.21) . 

Theorem 13. The boundedness problem for Datalog(J.) over ranked trees is in 
2- ExpTime. 

Over words, the relations J. and 4-+ are interpreted as the “next position” and 
the “following position”. Let X be a position in a word w. The n-neighbourhood 
of A' in w is an infix of w, which begins on position max(l, X — n) and ends on 
position min(|w|,A' +n). The following lemma is motivated by Proposition 3.2 
of [T2] . Its proof is provided in Appendix 1C.21 

Lemma 14. Let V be a DatalogQ.) program. Then V is bounded iff there exists 
n > 0 such that for every word w and position X if X £ V{w) then X £ V{v), 
where v is the n-neighbourhood of X in w. 

Proof (of Theorem \12\) . A word w such that for some position A' in w we have 
A £ V(vj) but A' V(v), where v is the n-neighbourhood of X in w will be 
called an n-witness. By Lemma 1141 a DatalogQ,) program V is unbounded iff 
there exist n- witnesses for arbitrarily big n > 0. 

Consider a Datalog(J,) program V. Let Eq be an alphabet that contains the 
set of labels used explicitly in the rules of V together with N “fresh” labels, 
where N is the size of the biggest rule in "P. It is known [T7] (and easy to verify) 
that any word w can be relabeled so that the obtained word w' uses only labels 
from Ao, and for each position X we have that A £ T(w) iff A £ V(w'). This is 
also true with respect to infixes, i.e., for every infix v of w, and every position A 
it holds that X £ V(v) iff A € V(v'), where v' is the corresponding infix of w'. 
Hence, we can verify the existence of n-witnesses over the finite alphabet Eq. 

In the proof of Theorem [5] (see Appendix lB.il) a nondeterministic automaton 
is introduced that recognizes words over the alphabet E 0 satisfying V. More 
precisely, the constructed automaton Ap works over the alphabet E 0 x {0,1}, 
and accepts a word w iff it has exactly one position A marked with 1 such that 
X £ V(w). We denote the language recognized by Ap by L(Ap). The size of 
this automaton is exponential in the size of V. 

Similarly, we obtain an automaton A fp recognizing these words over the al¬ 
phabet Eq x {0,1} which have exactly one position marked with 1 but do not 
belong to L(Ap). The size of Afp is also exponential in the size of V (there is 
no exponential blow up because the constructions in Appendix IB. II go through 
alternating automata) and the language it recognizes will be denoted L{Afp). 
Note that this language is closed under infixes containing the marked position. 

We define a nondeterministic automaton Bp which accepts exactly those 
words belonging to L{Ap) which have an infix that belongs to L(Afp). The states 
and transitions of Bp are the states and transitions of the product automaton 
Ap x A fp together with the states and transitions of two copies of the automaton 
Ap denoted A)? and A Let q init be the initial state of Afp. For each state q 
of A]y we add to Bp an epsilon transition from the state q to the state (g, qinit) 
of the product automaton. Now, let F be the set of final states of Afp. For each 


state q of Ap and each qfi n £ F we add to Bp an epsilon transition from the 
state ( q , qfi n ) to q. The initial state of Bp is the initial state of Ap and the final 
states of Bp are the final states of Ap. Hence, an accepting run of the automaton 
Bp starts in Ap, moves to the product automaton at some point, reads an infix 
that belongs to L(Afp) and finally goes to A^ to accept. 

Let N be the number of states of the product automaton Ap x Afp plus 1. 
Suppose that Bp accepts an fV-witness w. Then, due to the pumping lemma, 
it accepts n- witnesses for arbitrarily big n > 0. To end the proof show that 
checking whether Bp accepts some iV-witness is in NLogSpace in the size of 
the automata Ap and Afp (i.e., in PSpace in the size of V). 

An TV-witness is a word that belongs to L{Ap) but the 7V-neighbourhood 
of the position marked with 1 belongs to L(Afp). The NLogSpace algorithm 
simulates a run of the automaton Bp. The size of Bp is exponential in the size 
of V but its states and transitions can be generated on the fly in polynomial 
space. The algorithm guesses a state from the Ap x Afp part and checks if it 
is reachable from the initial state. This is a simple reachability test which is 
in NLogSpace. Then it guesses some run of the Ap x Afp part, counts the 
number of transitions done before the one marked with 1, and ensures that it 
is at least N. After the transition marked with 1 it ensures that the automaton 
makes at least N more transitions before leaving the Ap x Afp part. For both 
of these counting procedures we need log(AT) tape cells. Finally, the algorithm 
performs a second reachability test to check if the automaton can reach a final 
state. 

There are three possible ways of how an N-witness v may look like. For 
simplicity, the algorithm described above does not deal with the case when the 
IV-neighbourhood that belongs to L(Afp) is shorter then 2 N + 1 (which can 
happen if it begins at the first position of w or ends at the last position of w). 
Those possibilities can be verified similarly. □ 

Notice that if V is bounded then N from the proof above is the bound on the 
depth of recursion. Since the size of the constructed automaton is exponential in 
the size of the program V, the UCQ which is equivalent to this program consists 
of proof trees of size at most exponential in the size of V . 

5 Boundedness vs equivalence 

In this section we focus on the similarities between the boundedness and the 
equivalence problem for datalog programs. In Sections[3]and[3]those problems are 
treated separately but with similar techniques. Also in Q21, where boundedness 
and equivalence are considered for monadic programs on arbitrary structures, 
both problems are solved using the same automata-theoretic construction. For 
these reasons we investigate the connection between the two problems in more 
detail. In contrast to the previous sections, in this section the structures under 
consideration are not necessarily trees or words. 

Definition 15. A class C of datalog programs over a fixed class of databases is 
called well-behaved if: 


1. for every program V € C all the UCQs corresponding to the proof trees for 
V belong to C, 

2. containment of a UCQ in a datalog program is decidable for C. 

Condition (1) is satisfied for most natural classes of programs. In particular by 
the class of all datalog programs on arbitrary structures and the class Datalog(4.) 
on trees. For the class of datalog programs on arbitrary structures Condition 
(2) is also known to hold true (see |9lldl20] '). Lemma [7] shows that the class 
DatalogQ) on trees satisfies Condition (2). Hence both those classes are well- 
behaved. 

We say that C has a computable bound if there exists a computable func¬ 
tion / such that if a datalog program V in C is bounded and f(V) = n then 
V{D) = T n (D) for any database D , i.e., for bounded programs the function / 
returns a bound on the depth of recursion. For programs which are not bounded 
/ returns some arbitrary natural numbers. 

Example 16. Consider the full datalog. It follows from the results of m that 
the class of monadic datalog programs on arbitrary structures has a computable 
bound. It is not stated explicitly but a closer analysis of the proofs gives that for 
a bounded program V the depth of recursion can be bounded polynomially in the 
size of the automaton constructed to check if V is bounded. For example, for a 
linear connected program the size of such an automaton is bounded exponentially 
in the size of the program. 

The following theorem for a well-behaved class C with a computable bound 
establishes a connection between the problems of boundedness and equivalence 
to a given UCQ. 

Theorem 17. For any well-behaved class C with a computable bound the follow¬ 
ing conditions are equivalent: 

1. boundedness is decidable, 

2. it is decidable whether two programs are equivalent, given that one of them 
is a UCQ. 

Proof. Let / be the function from the definition of the computable bound. For 
the implication from (1) to (2), take programs V and Q which belong to C and 
assume that Q is a UCQ. Since C is well-behaved, we only need to show how to 
decide whether V is contained in Q. It follows from the assumption that we can 
verify if V is bounded. If this is the case, then let /( V) = n. Observe that P is 
equivalent to the UCQ V that corresponds to the proof trees for V of height at 
most n. It remains to decide whether the UCQ V' is contained in Q. 

Suppose now that V is not bounded and consider a union 1Z of the programs V 
and Q. More formally, let 1Z be a program containing the rules of both programs 
V and Q. If the predicate Q occurs in the program V we rename it so that 
the predicates do not coincide. The goal predicate 1Z holds for X iff we have 
V(X) or Q(X). For this we introduce two additional rules 1Z(X) <— V{X) and 
7 Z(X) <— Q(X). The atoms Q(X) are all inferred in one step. Therefore, if 7 Z is 




unbounded then there exists X satisfying V(X) such that Q(X) does not hold, 
and hence V is not contained in Q. If 1Z is bounded then using / we construct 
an equivalent UCQ 1Z' and check whether it is equivalent to Q. If this is the case 
then V is contained in Q. Otherwise it is not. 

For the other implication, consider a datalog program V G C and let f(V) = n. 
Then V is bounded iff T(D) = V n (D) for any database D. Let Q be the UCQ 
that corresponds to the proof trees of V of height at most n. It suffices to decide 
whether the programs V and Q are equivalent. But this is decidable from the 
assumption that C is well-behaved. □ 

While assuming that a class of programs is well-behaved is natural, the existence 
of a computable bound is a strong assumption. It is needed since an algorithm 
that solves the boundedness problem might not be constructive, meaning that 
we do not know how big the equivalent UCQ is. However, deciding if such a 
function exists is usually as hard as solving the boundedness problem. From Ex¬ 
ample [TH] we know that for monadic programs on arbitrary structures there exist 
constructive algorithms for the boundedness problem, and hence we have a com¬ 
putable bound. On the other hand, the undecidability results of the boundedness 
problem for datalog on arbitrary structures rely heavily on the fact that such 
a computable bound does not exist. In ruTTBl the authors present reductions 
from the halting problem for 2-counter machines and Turing machines. If a dat¬ 
alog program is bounded then the size of the equivalent UCQ corresponds to the 
length of an accepting run of these machines, which of course cannot be bounded 
by a computable function. The results of our paper are, in this sense, similar: the 
positive results provide computable bounds whereas the negative results rely on 
the fact that such a function does not exist. For these reasons we conjecture that 
for well-behaved classes of datalog programs the decidability of the boundedness 
problem is equivalent to the decidability of finding a computable bound. If this 
conjecture holds true then Theorem [T71 becomes an implication from (1) to (2) 
because the opposite implication is trivially satisfied. 

6 Conclusions 

The equivalence to a given nonrecursive program and the boundedness problem 
for Datalog(|, j_ + ) are undecidable. To regain decidability we considered programs 
that do not use the j, + relation. We showed that equivalence to a given UCQ 
over ranked trees is decidable, and over unranked trees it is decidable in the case 
of linear programs. We also showed the decidability of boundedness on words 
and ranked trees. In the most general case of non-linear Datalog(j,) programs 
over unranked trees we do not know if the two problems under consideration are 
decidable and we leave these questions as open problems. 

We also investigated the connection between the boundedness and the equiv¬ 
alence to a UCQ. We showed that these problems are equivalently decidable for 
classes of programs with a computable bound. We suspect, however, that the 
existence of a computable bound for a class of programs is equivalent to the 
decidability of the boundedness problem. We also leave this as an open problem. 
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A Definitions 


A.l Automata 

Throughout the paper all decidability results use automata constructions. We 
briefly recall the standard automata model for ranked trees here. 

A (bottom-up) tree automaton A = (T, Q 1 S, F) on at most R- ary trees 
consists of a finite alphabet r, a finite set of states Q , a set of accepting states 
F C. Q, and transition relation <5 C (J" =0 Q x F xQ l . A run on a tree t over r is 
a labeling p of t with elements of Q consistent with the transition relation, i.e., if 
v has children vi,v%,Vk with k < n, then (p(v), labt(v), p(v ±),..., p(vk)) £ S. 
In particular, if v is a leaf we have (q, a) £ 5. Run p is accepting if it assigns a 
state from F to the root. A tree is accepted by A if it admits an accepting run. 
The language recognized by A , denoted by L(A), is the set of all accepted trees. 
We recall that testing emptiness of a tree automaton can be done in PTime, but 
complementation involves an exponential blow-up. For a special case, when the 
model is words testing emptiness is in NLogSpace. 

As an intermediate automata model, closer to datalog than the bottom-up 
automata, we shall use the two-way alternating automata introduced in [Ij2]. A 
two-way alternating automaton A = (T, Q, qi, S) consists of an alphabet /’, 
a finite set of states Q , an initial state qj £ Q, and a transition function 

5: Q x r —> BC + (Q x {-1,0,1}) 

describing actions of automaton A in state q in a node with label a as a positive 
boolean combination of atomic actions of the form (jp, d) £ Q x {—1, 0,1}. 

A run p of A over tree t is a tree labelled with pairs (q, v), where q is a state 
of A and v is a node of t, satisfying the following conditions: the root of p is 
labelled with the pair consisting of qo and the root of f, and if a node of p with 
label (q,v) has children with labels {qi,v ±),..., ( q n ,v n ), and v has label a in t, 
then there exist di,... ,d n £ {—1,0,1} such that: 

— Vi is a child of v in t for all i such that d, = 1; 

— m = v for all i such that d r =0; 

— Vi is the parent of v in t for all i such that di = —1; and 

— boolean combination d(q, a) evaluates to true when atomic actions 
(qi, d \),..., (q n , d n ) are substituted by true , and other atomic actions are 
substituted by false. 

Tree t is accepted by automaton A if it admits a finite run. By L{A) we denote 
the language recognized by A; that is, the set of trees accepted by A. 

According to the definition above, two-way alternating automata only distin¬ 
guish between going up, down, and staying where they are. In a more general 
model, appropriate for ordered ranked trees, one could also distinguish between 
going to the first child, the second child, etc. Given that our datalog programs 
are not able to make such distinction, this simplified definition suffices. 

The computation model of two-way alternating automata is very similar to 
that of datalog programs, making them a perfect intermediate formalism on the 


road to nondeterministic bottom-up automata. From there one continues thanks 
to the following fact. 

Proposition 18 ([12]). Given a two-way alternating automaton A (interpreted 
over words or ranked trees), one can compute (in time polynomial in the size of 
the input and output) single-exponential nondeterministic bottom-up automata 
recognizing the language L(A) and its complement, respectively. 

Notice that complementing two-way alternating automata is not trivial because 
there can be infinite runs that are not accepting. 

A.2 Canonical models and homomorphisms 

Let r be a satisfiable rule of a datalog program V. Recall from Section [2] that 
G r is a graph of nodes from r. A pattern 7r r has the same nodes and edges as 
G r but the type of edge between nodes (), or | + ) is distinguished. The nodes 
are labeled with variable names. If there is an extensional unary predicate, e.g. 
a(X ), specified by the rule then we replace the label X with a. We simulate the 
relation ~ by repeating variable labels. 

Since in our setting the relation 4-+ is disallowed, we can always transform 
a satisfiable rule r into an equivalent rule r' such that n r > is a tree. This is 
because our models are trees and therefore nodes that have a common child can 
be merged into one node. 

Example 19. The rule P is transformed into its tree version P'. On the right 
there are patterns corresponding to these rules. The repeated occurrence of X 
represents the relation ~ in the patterns. 


P(X ) <- XfY, YfZ, TfZ, a(T),X ~ Z 

X 

/ 

X 

[ 

l 

Y a 

a 

P'(X) <- XfY, YfZ, a(Y),X ~ Z 

\ J 

{ 


X 

X 


A homomorphism from a pattern 7r r to a model tree t is a function between 
nodes that preserves the extensional predicates. A proof tree is witnessing an 
evaluation of the program on a given model t iff for all rules there is a homomor¬ 
phism from their patterns to t such that the intensional nodes are mapped to 
the same nodes as the head nodes in the following rules. The connection between 
patterns and datalog is explained in more detail in D2I- 

From a satisfiable proof tree we obtain a canonical model. First we change 
rules to patterns and merge head nodes with intensional nodes. Nodes labeled 
with variables are relabeled with fresh labels (preserving the equalities forced 
by ~). The obtained graph can be seen as a pattern of the proof tree. Then we 
turn it into a tree similarly as in Example 1191 

It is easy to see that it suffices to consider the containment problem only on 
canonical models. If there is a model t for V A ->Q then there is a witnessing 
proof tree for V on t. The canonical model corresponding to this proof tree is 
also a model for V A -<Q. 


B Equivalence 


We decide containment by constructing an automaton that is non-empty iff there 
is a counterexample to containment. To do this, we mark a single node in a tree, 
and use the automaton to verify if the goal predicates of programs in question 
are satisfied in this node. Formally, we extend the alphabet by taking its product 
with {0,1}, and recognize models which have exactly one node marked with 1. 
To obtain tight complexity bounds, we use two-way alternating automata. The 
same technique was used in m- 

B.l Special case: words 

Over words, the relations 4- and 4+ are interpreted as the “next position” and 
the “following position”. 

Lemma 20. Let V £ DatalogQ,) and let Eq be a finite alphabet. There exists 
a two-way alternating automaton that accepts all words over Eq x {0,1} with 
exactly one position with label (a, 1) for some a £ Eq, such that V holds in that 
position. The automaton can be constructed in time polynomial in [P\ and |T’o|- 

Proof. Let us fix a program V £ Datalog({) and a finite alphabet Eg. The 
alphabet is E 0 x {0,1} but most of the time the second component is ignored. 
Since we work over words (and consider only connected programs) without loss 
of generality we can assume that each rule r is of the form 



i=k 


where k < 0 < i and tp(xk,Xk+ i , ■ ■ ■, xf) is a conjunction of atoms over unary 
predicates and that is, it does not use This means that the pattern corre¬ 
sponding to the body of r is a word. 

In the automaton Ap = (Ho, Q , go, 6) we are about to define we allow tran¬ 
sitions of a slightly generalized form: the transition function S assigns to each 
state-letter pair a positive boolean combination of elements of 


Q x {—N, —N + 1,..., N} 


for a fixed constant feN, rather than just Q x {—1, 0,1}. The semantics of this 
is the natural one: ( q , k ) means that the automaton moves by k positions (left 
or right depending on the sign of k ) and changes state to q. Each generalized 
automaton can be transformed to a standard one at the cost of enlarging the 
state-space by the factor of 2N + 1. In our case N will be bounded by the 
maximal number of variables used in a rule of V. 

Let us describe the automaton Ap. The state-space Q is 


^oUPU {go}; 


that is, it consists of the letters from Eq, the rules of V and an additional 
initial state q$. The transition relation <5 is defined as follows. In the initial state, 
regardless of the current letter, we loop moving to the right until we reach the 
position in the word where we start evaluating V: 

5(qo, (-,0)) = (go, +1), 

<%o, (-,!)) = (r g oai,0), 


where r goa i is the goal rule of V. This is the only case when Ap does not ignore 
the component {0,1} in the alphabet. That is we require that there is 1 in the 
second component when the first goal rule is applied. When we are in state r GV, 
regardless of the current letter, we check that the body of r can be matched in 
the input word in such a way that xq is mapped to the current position: 

S(r,-)= f\ (a,i) A /\ \J (b, i) A (b,j) A /\ \J ( r',i ), 

a{xi ) Xi~Xj b ££o R(xi) v'GVr 

where a(xi), Xi ~ Xj, and R(xi ) range respectively over labels, ~, and intensional 
atoms of r, and Vr C T> is the set of rules defining intensional predicate R. In 
state a G Eq we simply check that the letter in the current position is a: 

S(a , a) = T , and S(a, b) = T for b ^ a . 

Checking correctness and the size bounds for A-p poses no difficulties. Taking a 
product of Ap with an automaton (of size linear in | I) that checks if there is 
exactly one position with label (a, 1) for some a G Sq gives the automaton from 
the statement. 

Now we can show the proof of Theorem [5] for the case of words. 

Proof (of Theorem^ (words)). In Proposition 2 of [T7( it is shown that over words 
it suffices to check satisfiability of V A -<Q over an alphabet Eq of linear size. 
For programs V and Q, let Ap and Aq be alternating two-way automata given 
by Lemma 12(11 From automata Ap and Aq , by Proposition [l8l we obtain one¬ 
way non-deterministic automata Bp and of exponential size that recognize 
respectively the language L{Ap) and the complement of L{Aq). From this we 
easily get a product automaton Bp A ^Q equivalent to the query V(x) A ^Q(x). 
Indeed, it accepts all words over Eq with exactly one position x marked with 1, 
such that V(x) A -iQ( x). 

The size of Bp A ^Q is exponential in the size of V, Q , but its states and 
transitions can be generated on the fly in polynomial space. To check emptiness of 
Bp A —,Q we make a simple reachability test, which is in NLogSpace. Altogether, 
this gives a PSpace algorithm. □ 

B.2 Ranked trees 

The results for words can be lifted to ranked trees: complexities are higher, but 
the general picture remains the same. 


Lemma 21. Let V £ Datalog({_) be a program with rules of size at most n and 
let Eq be a finite alphabet. There exists a two-way alternating automaton A-p of 
size 0{\\V\\ ■ IL'ol" • n) recognizing trees over Eq x {0,1} with exactly one node 
with label (a, 1) for some a £ Eq, such that V holds in that node. 

Proof. Let us fix a program V £ Datalog({_) and a finite alphabet Eq. Given that 
we are only interested in trees over alphabet Eq, we can eliminate the use of ~ 
from V: if a rule contains x ~ y we replace this rule with |I7o| variants in which 
x ~ y is replaced with a(x) A a(y) for a £ Eq. The size of the program grows by 
a 0(\Eo\ n ) factor; the size of the rules grows only by a constant factor. 

Since we are working on trees we can further transform the program so that 
the patterns corresponding to the rules of the program are trees (with in and out 
nodes positioned arbitrarily). Indeed, it can be done by unifying variables x and 
y whenever the rule contains x f z and y { z for some variable z, and removing 
rules containing atom u j, u, or atoms a{u) and b(u) for some variable u and 
distinct letters a and b (see Example 1 191) . This modification does not increase 
the size of the program. 

Finally, we rewrite each rule into a set of rules of the form 

£ 

H(xo) <- a(xo) A axi(x 0 ,Xi) A il>(xo,xi, ...,xt) 

i= 1 

where a £ Eq, ax.i(xo,Xi) is either xq {■ aor Xi f Xq, and ip(xo,xi,... ,xg) is 
a conjunction of (monadic) intensional atoms. That is, one rule can only test 
the label and some intensional predicates for the current node, and demand exis¬ 
tence of neighbours (children or parents) satisfying some intensional predicates. 
This modification introduces auxiliary intensional predicates, but the size of the 
program icreases only by 0(n) factor. 

The resulting program is essentially a two-way alternating automaton £>, only 
given in a different syntax. The automaton from the statement is obtained by 
modifying the automaton B similarly as in the case of words. 

Proof (of Theorem [5| (trees)). In Theorem 1 of [IT] it is shown that for trees it 
suffices to verify containment over a finite alphabet Eq , although for trees Eq is of 
exponential size. Using Lemnra[2T]and Proposition ll8l we reduce the containment 
problem to the emptiness problem for a nondeterministic tree automaton of a 
double exponential size in \V\, and test emptiness with the standard PTime 
algorithm. □ 

The lower bounds can be obtained by a straightforward modifications of the 
results in m- 

B.3 Satisfiability on unranked trees 

Proposition 22. The satisfiability problem for L-Datalog(j,) on unranked trees 
is in ExpTime. 


Before proving this result let us introduce the notation. 

Definition 23. Let Eq be a finite alphabet. A universal Eq -tree is a full |Lb|- 
ary tree over Eq such that every non-leaf node has a child with each label from 
Eq. For a G Eq, n € N, we will denote by [/“ a universal Eg-tree of height n and 
with a in the root. 



/i\ 

abcabcabc 

Fig. 1: A universal Ab-tree U!£ for E 0 = {a, b, c} 


The proof will proceed as follows. First, we will show that if V is satisfiable, 
then it is satisfiable in a universal Uo-tree. Then it is easy to see (combining 
Lemma I2T1 and Proposition IT8]> that the set of universal Zb-trees satisfying V is 
regular and recognized by an automaton with number of states double exponen¬ 
tial in \V\. For linear programs however, we can do better and get an ExpTime 
algorithm. 

Lemma 24. LetV € DatalogQ,) and let Eq be a finite set of labels, s.t. Ep C Eq. 
The program V is satisfiable iffV is satisfiable in a universal Eg-tree. 

Proof. It suffices to show that if V is satisfiable, then it is satisfied in some 
universal Zo-tree. The other direction is obvious. Let t be a model for V. Recall 
that E-p is the set of constants used in V. First, we can change all labels from t 
that are not in Eq to a single label chosen from Eq (preserving the equalities). 
Since DatalogQ.) programs do not use negation and this operation can only 
make the set V(t) bigger. Next, we perform the following operation. If a node of 
v has two or more children with the same labels then we merge these children 
into one node v. The resulting node has children from both of the merged nodes. 
It is easy to check that this operation preserves homomorphisms and does not 
change the emptiness of the set V(t). We apply this procedure until there are no 
siblings with the same label. Finally we add nodes to the obtained tree so that 
it becomes a universal Z 0 -tree. Of course adding nodes cannot decrease the set 
P(t), which finishes the proof. □ 

From now on we assume that V is a linear program. We will actually prove 
a stronger result that will be useful for deciding the containment of a datalog 
program in a UCQ. We will show an algorithm for calculating all possible ways 
of evaluating the program V in the universal Zo-tree such that the evaluation 
uses the root of this tree. 

First, we need to introduce a notion of a partial matching of a datalog pro¬ 
gram. We say that a rule is matched to a tree t. if there is a homomorphism from 


its pattern into t. Let i\r 2 .. .r n be a proof word. A partial matching m of a 
program V into a tree t is an infix n . .. r 3 of a proof word such that all the rules 
ri+i,..., rj_i are matched completely and r* and r 3 are partially matched, such 
that the images of the intensional nodes are equal to the following head nodes. 

Each partial matching m can be represented by a pair of partial homomor- 
phisms from the patterns of the first and the last rule of the infix of the proof 
word. We are interested in the partial matchings that map one of the nodes of 
the pattern to the root of the tree. Thus each partial homomorphism can be 
represented as a partial function from pattern 7r into Of course there are 
also partial matchings with nodes mapped below the root of the tree, and one 
end of a partial matching may be not possible to extend. This situation can arise 
when the goal rule is at the beginning of the matching; or the non-recursive rules 
are in the last position (leaves). We use an additional symbol OK to mark this 
situation. 

We denote the set of all partial matchings of V by Match(V). The size of 
MatchlfP) is exponential in the size of V. Obviously it suffices to calculate the 
set of all partial matchings into a tree to determine if it satisfies V. 

Lemma 25. Let £q be a finite alphabet. The set of partial matchings of V 
matched in a root of any E^-universal tree can be calculated in time exponential 
in \V\ for linear programs. 

Proof. For a tree t we will denote the set of partial matchings in the root of t by 
matched(t). Observe that because every partial matching of V in is also a 

partial matching of V in ?7“ then matched is monotonic, i.e., matc/ied(17“_ 1 ) C 
matched{U!f). This observation yields a simple algorithm. There are |Ab| dif¬ 
ferent universal trees of height n. To calculate matched ([/“) for each a £ £q 
it suffices to join partial matchings from (J hgr matched(U„_i) using the root 
node labeled with a and add the previously calculated matched((T^_ 1 ). Note 
that if matched(U!f) = matched (f/“ +1 ), then matched{U!f) = matchedllUff) for 
all m > n. Therefore, the described procedure requires at most \Match{fP)\ steps 
to terminate, each step takes 0(\Match(V)\ ) time which gives an ExpTime al¬ 
gorithm. □ 

B.4 Proof of the upper bound in Theorem [6] 

Let V be a L-Datalog(j,) program and let Q be a UCQQ,). Our goal is to determine 
whether for all databases D we have V(D) C Q(D). We solve the dual problem 
and look for a counterexample for the containment, i.e., a database D and a 
node X £ D such that X £ V(D) but X ^ Q(D). Moreover, we can assume that 
I? is a canonical model. Let n be the size of the biggest conjunct in Q. Since Q 
is nonrecursive and connected, to determine if X £ Q(D) it suffices to check the 
subtree of D containing nodes of distance at most n from X. 

We shall refer to V as the positive query and to Q as the negative query. We 
define an automaton A = (Q, A, S, go, F) that essentially recognizes satisfiable 
proof words for V simultaneously checking if the negative query is satisfied on 


the canonical model of the read word. The alphabet A = {n,..., r m j is the set 
of rules of the program V. 

We define the set of states Q as a cartesian product of three components, 
i.e., Q = Qi x Q 2 x Q 3 . We describe each component separately. Recall that Ep 
denotes the set of constants used explicitly in rules of program V. Let N be the 
size of the biggest rule in V. Let B\ be an alphabet of N different letters and 
let i ?2 be an alphabet of 2n + 1 letters, disjoint from B±. 

In the first component Q\ the automaton stores a labeled pattern correspond¬ 
ing to the currently read letter (rule). Formally, 

Qr = 53(^7, UBiUBa)'-. 

r£A 

We identify the pattern 7 iy with the set of its nodes, thus Q\ is the set of patterns, 
whose nodes are labeled with elements of the set E-p U B\ U B 2 . The intended 
meaning of B\ and B 2 will be explained later. 

In the second component Q 2 the automaton stores a word w of length at 
most 2 n + 1 and its position compared to the node X. Formally 

Q 2 = (S r UB 1 UB 2 ) [i] x (&,Z), 

l<i<2n+l 

0<k,l<n 

where [*] = {1,..., *} and (Ep UBiU B 2 )W is the set of words of length i with 
labels from Ep U5i U I? 2 - This word is a representation of an ancestor-path 
starting from the intensional node v z of the current pattern stored in Qi. This 
is necessary to verify if the proof word is satisfiable. The ancestor path could be 
arbitrary long but, as we will see, we only need to remember nodes that are of 
distance at most n from X (there are at most 2n + 1 such nodes). Additionally 
the automaton remembers how this path lays compared to X. For this it stores 
a pair of numbers (k,l) such that 0 < k,l < n. Let v a be the least common 
ancestor of X and 'ty. The number k denotes the distance between X and v a , 
and the number l denotes the distance between v a and V{. Note that k + l is the 
distance between v t and X. Also if k = 0 then t;,; is a descendant of X, and if 
l = 0 then A is a descendant of V{. 

The last component Q 3 is the set of partial homomorphisms of the patterns 
corresponding to CQs from the negative query Q. Let w be the word stored 
in the second component and let Aq be the set of all conjuncts if from Q. 
Formally, Q 3 = F-p where F v is the set of all partial functions from 7T, f 

to Ep UB 1 UB 2 U {b} U {nq,..., iU| w |}. The interpretation of the labels will be 
explained later. 

We now define the transition relation S. Suppose that the automaton reads 
a new letter r. Let q = ( 91 , 92 , 93 ) € Qi x Q 2 x Q 3 be the previous state. We 
show how the automaton calculates its new state 9 ' = (q[, q' 2 , q' 3 ). 

In the first component the automaton starts from checking if the rule r is 
proper for the intensional predicate in the previous rule; or if it is the first letter 
then the automaton checks if it is the goal predicate. If none of these cases 



holds then the automaton immediately rejects the word. Otherwise it labels 
7 iy in two phases. In the first phase it labels its head node Vh with the same 
label that the intensional node in <71 has. Also the labels of the nodes that are 
ancestors of Vh must match the corresponding labels from the path in q Then 
the automaton labels nodes that have an explicit label from Ep. In the second 
phase the automaton guesses the remaining labels from Ep U Bi U B 2 respecting 
the ~ relation. If there is a node on which ~ forces two different labels, then the 
automaton rejects the word. This way we use a small alphabet to represent an 
arbitrary large set of labels. If in the state q' we use a label that is also used in 
the state q but ~ does not force them to be the same then we assume that in 
the canonical model they are different labels. 

In the second component the automaton updates first the pair (k,l) so that 
it agrees with the location of the new intensional node with respect to X. Then 
it creates a new ancestor-path whose labels have to agree with the labels of the 
old path in q 2 , and the labels of those nodes in q[ that are ancestors of u*. The 
case when the distance of the new intensional node to X is bigger than n is 
explained later. 

In the last component the automaton starts from updating the old partial 
functions. All labels that appeared in q\ but were not used in the first phase 
are replaced with t>. The intended meaning is that these labels no longer appear 
in the model. Actually this is where we use the crucial feature of the canonical 
models. Since we use fresh labels whenever it is possible the automaton can 
forget all labels that will no longer appear. 

The automaton forgets all partial homomorphisms that have unmapped nodes 
such that their label is forced by ~ to be equal to a node labelled by t>. This is be¬ 
cause such homomorphisms can never be fulfilled. Then the automaton extends 
the remaining homomorphisms with new nodes from q[. The label Wi denotes 
the fact that the node was mapped to the corresponding node from the path in 
q' 2 . The next step is to relabel the partial homomorphisms so that they agree 
with the new path. This way the automaton knows where it can extend the 
homomorphisms. Note that if there is a partial homomorphism without any Wi 
then it can be discarded because it cannot be extended. If at any time one of 
the homomorphisms becomes a full homomorphism then the automaton rejects 
the word. 

So far we explained the behavior for the letters in the proof word that have 
the intensional node of distance at most n from X. This is of course not the 
only possible case, but we already noticed that nodes of bigger distance have no 
impact on the negative query. Because of this now we can use the results for the 
satisfiability problem. Suppose that the automaton reads a letter r such that its 
intensional node is of distance bigger than n from X. The automaton updates 
the third component of its state in the usual way and rejects the word if a full 
homomorphism is found. Let v be the ancestor of the intensional node in r such 
that v is of distance n from X. The automaton assumes that there is a universal 
tree t v over the alphabet E-p U B\ U B 2 (see Definition [Till) below v. It calculates 
the set matched(t v ) and finds all matchings that have the rule r as the first rule 


with the node v in the root. The automaton chooses one of the matchings but 
the last rule r' can also have the intensional node below v. Then it proceeds 
with r' as it did with r. Eventually the automaton guesses a matching such that 
the intensional node of the last rule r" is of distance at most n from X. Then it 
stores r" in q[ and updates the other states in the usual way. If instead of the 
last rule there is OK then the automaton accepts the word. 

Notice that the node v could not exist. This happens when the least common 
ancestor of X and the intensional node of r is of distance bigger than n from 
X. If such a situation occurs then, since we assumed that we work on canonical 
models, all nodes from the next rules will be of distance bigger than n from A'. 
Thus it suffices to check satisfiability starting from the rule r. 

We slightly modified the canonical models using universal trees. For the posi¬ 
tive program we showed in Lemma[23]that we can use universal trees; and for the 
negative program we assured that the changes are on nodes that are of distance 
bigger than n from X. 

The constructed automaton is non-empty iff there is a canonical model for 

V A -iQ. We need to bound the size of the set of states. In the first component 
every labelled rule ( B\ Ul?2 U X-pY r is of exponential size in \P\ and the number 
of rules is bounded by the size of V. The second component is a set of triples: 
two numbers and a word of size at most 2n + l, which is exponential in the size of 
V, Q. The third component is the powerset of all partial homomorphisms which is 
double exponential in the size of V and Q. Thus the whole automaton is bounded 
double exponentially. However, its states and transitions can be generated on the 
fly in exponential space. To check its emptiness we make a simple reachability 
test, which is in NLogSpace. We use the results about satisfiability to generate 
all transitions, but by Proposition [22] this can be done in ExpTime. Altogether, 
this gives an algorithm in ExpSpace. 

B.5 Proof of the lower bound in Theorem [6] 

We consider the satisfiability problem of V A -iQ, where V £ L-DatalogQ.) and 
Q £ UCQ(|). To prove hardness, for a number n and a Turing machine M. we 
construct datalog programs V and Q of size polynomial in \M\ and n such that 

V A —•Q is satisfiable iff M accepts the empty word using not more than 2" tape 
cells. The program V will encode the run of the machine, and the program Q 
will ensure its correctness. 

Assume that B is the tape alphabet of M, Q is the set of states, F is the set 
of accepting states and S is the transition relation. The finite alphabet used by 
the programs will contain sets B and B x Q. The symbols from B x Q will be 
used to mark the position of the head on the tape and the state of the machine. 

We now define the rules of the positive program V. The program starts in a 
node labeled with T. We encode each configuration of M (the current state and 
the tape contents) by enforcing a full binary tree of hight n. For this we need the 
alphabet S-p to contain the set Xa<i< n {(-^’ *), (-R, *)}• The predicates ( L,i ) and 
(R, i) denote the left and right son of the previous node, respectively. The tape 
is encoded in the nodes below the leafs of the tree. The label of the node above 


the root of the tree is used as an identificator of the encoded configuration. We 
will refer to it as an identification node. 

The goal rule is 

G V {X) <r- T(X),XfY, Init(Y), YfZ, conf(Z). 

It means that the encoding of the initial configuration of the machine, which 
identification node is labeled with Init, is stored in the tree (note that Init 
belongs to S-p). The program will then traverse the configuration trees one by 
one in an infix order. 

conf(X) <r- XfY, ( L , 1 )(Y), downleft 1 (Y) 
downleff(X) <— XfY, ( L, i + 1)(Y), downleft l+1 (Y) 
i = 1,».., n — 1 

downleft n {X) XfY, store(Y) 

store(X) 4— a(X),YfX, ( L , ?r)(F), upleft n (X) 
store(X) t— a(X),YfX, ( R , n)(F), upright n (X) 
for every symbol a € B U (B x Q) 
upleff(X) 4— Y\X, downright 1 ^ 1 (Y) 
i = 1,..., n 

downright 1 (X) 4— X\Y, ( R , i + 1 )(Y), downleft l+1 (Y) 
i = 0,..., n — 1 

upright l (X) 4— YfX, ( R,i — 1)(Y), upright 1-1 (Y) 
i = 2,..., n 

upright l (X) 4— YfX, ( L,i — 1 )(Y), upleft 1-1 (Y) 
i = 2,..., n 

upright 1 {X) 4— YfX, next{Y). 

Observe that when we reach downleft n we stop traversing the tree and the 
program uses the rule store to write the content of the tape. That is why there 
is no rule downright n . 

The program finishes traversing the tree in next and goes to the next con¬ 
figuration of the machine. We ensure that the identification node of the next 
configuration has the same label as the root of the tree which encodes the pre¬ 
vious one. This will enable the negative program to check the correctness of the 
encoding. 

next.(X) 4r- YIX, ZfY, Z\Y, Y ~ X, YfX, conf(X). 

We finish when we find an accepting state. That is for every letter a £ B and 
every q £ F we have two non-recursive rules 

store(X) 4 — (a,q)(X),Y].X, (L,n)(Y) 
store(X) 4- (a,q)(X),YlX, ( R,n)(Y). 



Now let us define the rules of the negative program Q, which will be a dis¬ 
junction of queries describing possible errors in the encoding. The content of the 
tape has to be defined uniquely. Hence, for each pair of different symbols a and 
6 from B U (B x Q ) we have a rule 


Gq(X) <-T(X), X\Y, Y±X 0 , X 0 |X 1; Xr|X 2 ,..., X n lZ lt X n iZ 2 , a(Z 1 ),b(Z 2 ). 


We cannot ensure that each configuration tree has its identification node 
labeled differently, but we can guarantee that trees with the same labels of the 
identification nodes encode the same configurations. For each pair of different 
symbols a and b from B U (B x Q) we introduce a rule 


G q (X) <-T(X), X\Y, XiY, Y ~ Y, YiXo, Y±X 0 , X 0 lX 1 ,X 0 iX 1 ,X 1 ~X U ..., 
■ ■ ■, X n _ilX n , X n _i].X n , X n ~ X n , X n \,Z, ,Z,a(Z), b(Z). 


We can also easily enforce that the configuration tree labeled with Init encodes 
the initial configuration of the machine with an empty word stored on the tape. 


Finally we have to make sure that the way the positive program V moves 
from one configuration to another is consistent with the transition function of the 
machine. To do this we consider changes in the content of any three consecutive 
tape cells, i.e., we take all tuples (oi, a 2 , a 3, b\,b 2 ,b^) of symbols from BU(BxQ), 
such that: if aq, a 2 , <23 encode a content of three consecutive tape cells i, i+ 1, i + 2 , 
respectively, then it is not possible for the machine to have 61,62,63 on those 
positions in the next configuration. For each of those tuples there is a set of 
2 (n — 1 ) rules in Q. The rules are constructed depending on the least common 
ancestor of the three leafs which encode the consecutive tape cells. We write 
them down for n = 3 . There are two rules that deal with the case when the least 



common ancestor is the root of the tree 


Gq(X) ±-T(X),XlY,YlX 0 ,XlY,X 0 ~Y, 

Xoix'Mixi x 0 ix?ix$ixg, 

(L, 1)(-X’ 1 ), (R, 2)(X 2 ), (L, 3)(X 3 ), 
(L,l)(X{),(i?,2)(X'),(i?,3)(X'), 
(i?,l)(A7),(L,2)(X"),(L,3)(X"), 

Yix 0 , x 0 |X4x 2 |x 3 , x 0 ix[ix' 2 ix' s , 

X V \r/ ~\r t y / / V^ 

1 ~ Ai,Aj ~A 1 ,...,A 3 ~ A 3 , 

XzlZx, ai(Zi), X' 3 ] r Z 2 1 a 2 (Z 2 ), X"-IZ 3 , a 3 (Z 3 ), 
A 3 4.Zi,6i(Zi), Xg4.Z 2 ,6 2 (Z 2 ), Ag |Z 3 , b 3 (Z 3 ) 

G q (X) <-T(X),XlY,YlXo,XiY,X 0 ~ F, 

a 0 ;a4a 2 |a 3 , a 0 ;a;;a'|a', x 0 |x"|a";a", 

(A, l)(-X' 1 ), (i?, 2)(A 2 ), (i?, 3)(A 3 ), 
(i?,l)(A0,(i,2)(A'),(L,3)(A'), 

(i?,l)(An,(A,2)(A"),(i?,3)(A"), 

n*o, A 0 |Ai|A 2 |A 3 , AoIAUA'IA', A 0 |A"Ut;A", 

X V y/ V^ y// y// 

l~Al,A 1 ~A 1 ,...,Ag ~ A g , 

A 3 4_Zi, ai(Zi), Xg4.Z 2 , a 2 (Z 2 ), X% iZ 3l a 3 (Z 3 ), 

Xa&MZi), A 3 I^2,6 2 (Z 2 ), X"|Z 3 ,6 3 (Z 3 ). 



And there are another two rules to deal with the case when the least common 
ancestor is labeled with (L, 1) or (R, 1) 


Gq(X) «-T (X), X\Y, Y{X 0 , XIY, X 0 ~ Y , 

XoiX lt XiiXaiXa, A^A'+A', A^A^A", 

(L, 2)(A 2 ), (L, 3)(A 3 ), 

(L,2)(A'),(f?,3)(A'), 

(i?,2)(X"),(L,3)(X"), 

Y\x < 4 *!, XrIXalXa, A^A'^, X^X'flX'f, 
X 1 ~X 1 ,X 2 ~X 2 ,...,X'£~X'£, 

Xa-lZi, ai(Zi), X' 3 ],Z2, 02 (^ 2 ), X3IZ3, 03 (^ 3 ), 
XsiZi,bi(Zi), X^Z- 2 , 62(^2), A’ 3 i-^ 3 ) ^3(^3) 

g q (a) <-t(a), A;y,nA 0 , a;f, a 0 ~ y, 

Xo|Ad,141 2 |X 3 , A^A'IA', XilXglXg , 

{L, 2)(A 2 ), (R, 3)(A 3 ), 

( j R,2)(A'),(L,3)(A'), 

(f?,2)(A"),(f?,3)(A"), 

nloUd, li|l 2 ;x 3 , A^A'IA', A^A^A", 

A-i ~ Xi , A 2 ~ A 2; ■ • • • A " ~ A", 

AsjAl, Cll(Zl), Agj-Zg, 02(^2), X'f 4 -^ 3 , 03(^3), 

Aslir, 6r (ir), A', & 2 (Z 2 ), A"|i 3 ,63(^3)- 

□ 


B.6 Proof of Lemma [7] 

Take programs P € Datalog(|) and Q G UCQ(j,). For every query ip in Q consider 
the pattern 7r v . Each of these patterns corresponds to a tree which is unique 
up to renaming of labels that are not explicitly mentioned by Q. Additionally, 
t v has one marked node A corresponding to the head node of ip. It remains to 
check if V(X) holds for each of these trees. It is well known that the combined 
complexity of monadic programs is NPTiME-complete. For each t v it suffices to 
guess the proof tree and verify the correctness of the guess. 

C Boundedness 

Proof (of Proposition^). The ’only if’ part is obvious. For the ’if’ part, suppose 
that a datalog program V is equivalent to a union of conjunctive queries Q. For 
every rule <p of Q consider a pattern 7r v . With each of these patterns we associate 
a set of trees: the possible homomorphic images of Up to renaming of the 


labels which are not explicitly mentioned by Q there are finitely many such trees 
(this is because ip is connected and does not use the relation | + ). We evaluate 
the program V on each of these trees and take n to be the biggest number of 
applications of the rules in V that we need. Now let t be any tree. We will show 
that V(t) = V n (t). To this end, consider a node X of t such that V(X). Since 
the programs V and Q are equivalent, Q(X) also holds. This means that for 
some CQ ip of Q there is a witnessing homomorphism h from ir v to t. Thus, we 
need at most n applications of the rules in V to derive V(X), because h{ tt^) is 
a fragment of t. □ 

C.l Undecidability of the boundedness problem in general 

Proof (of Theorem, \11\) . We will reduce the following problem: given a Turing 
machine M, are there arbitrary long runs of M that start from an empty tape and 
end in the halting state (denoted HALT). This problem is undecidable, because 
for a machine M, for every transition of M that goes from state q seeing symbol 
a on tape to HALT state, we add another transition that stays in the state q 
after reading a and does not change the position of M’s head. Thus, if M had a 
run that halted, modified M has arbitrary long halting runs. 

Let M be a Turing machine. We can assume without loss of generality that 
M has one tape, semi-infinite to the right. We will construct two programs, P 
and Q. Program P will find the encoding of the run of M on an empty input 
in the tree and Q will detect errors in the encoding. The Q program will be 
equivalent to a union of an UCQ. Moreover, we will ensure that for every correct 
run of M, there is only one corresponding encoding. Our program Pm will be 
an alternative of P and Q: 


P M {X) : —P(X) 

P M (X) : -Q(X) 

If a tree contains an error in the encoding, Pm will hold for every node of the 
tree in just 3 steps of the computation, because Q qill be equivalent to an UCQ. 
The constructed program will be not bounded if and only if M has arbitrary long 
halting runs. 

The run of M will be encoded as a word describing consecutive configurations. 
Configurations will be separated by ff symbols. The beginning of the encoding 
will be a START symbol and the end will be denoted by END. Each position on 
the tape will be encoded by 4 consecutive nodes, R — N — C — T where R will 
denote row number, N the number of the next row, C the column number and 
T the encoded tape symbol, s will be marked with 0 or 1 denoting if the head of 
M is in this position. Because we consider trees, the encoding will be placed in 
the tree from some node upwards to the root. This way, the program will have 
only one path on which it can match. Otherwise (that is, going downwards in 
the tree) the correctness of the encoding cannot be guaranteed. 

For each transition r of M, there will be a set of rules verifying that the two 
consecutive encoded configurations of M are consistent with r. Single rules will 


verify that the contents of the tape are copied/changed correctly between the 
configurations. To ensure that, the rule will look at each 3 consecutive positions. 
For each triple of tape symbols, there will be rule that matches 3 positions 
encoding those tape symbols. A rule ai a 2 a 3 (X) is true in X if 3 positions 
described directly above X contain symbols aq, a 2 , a 3 and the symbol in the next 
configuration in the same position as a 2 is also consistent with r. If the head of 
tape, this symbol should just be copied, but if head of M is in the position with 
a\,a 2 ora 3 the symbol can change between configurations. The i = 1 if the head 
of the tape was already seen in this configuration, 0 otherwise. For example, for 
a position where the head has not been seen in this configuration and there is 
no head in the inspected positions: 

-^r,(0,si),(0,«2),(0,S3)^ 1 ^ ' 

TMWiiRai-TilCtlNtlRamiCMlR! 

T 5 lC 5 lN 5 lR 5 lTdC 4 lN 4 lR 4 l + T 3 

(0, Sl )(Ti), (0, s 2 )(T 2 ), (0, s 3 )(T 3 ), (0, s 2 )(T 5 ) 

Ri ~ R5, f?4 ~ Ni, C5 ~ C 2 , C4 ~ C\ 

Ni ~ N 2 ~ N 3 , Ri ~ R 2 ~ f? 3 , R 4 ~ Rr, 

^r,(0,S2),(0,S3)i(*,S4) ) 

There will be such rule for any possible tape symbol (i, s 4 ). A quadruple Ri, Ni, Ci,Ti 
of variables describes one position of the tape, in the configuration Ri , with next 
configuration A 7 ,; and in column C t . The symbol stored in this position is Xj. 
Additionally, there will be rules for changing rows, that checks two last positions 
before the # and ensures that the next row is either the same length as the pre¬ 
vious one or one position longer (that is, has 4 more nodes), depending on the 
movement of the head. There will be also rules for the final row of the encoding 
(that is after reaching halting state), P/,„. Pfi n will just go to the last #, and 
P will be true in the root of the tree (with END label) if Pfi n is matched in the 
last #: 

The program Q is given below, where Q er r is an alternative of all possible 
errors in the encoding. 


Q(X) : -YUX,YUZ,Q err (Z) 

(1) 

Q(X) : -Qerr(X) 

(2) 

Q(X) : -Xl + Y,Q err {Y) 

( 3 ) 


Note the necessity of this triple alternative as j, + is a proper descendant relation, 
that is X^+X does not hold. This way, Q holds in every node of the tree if Q err 
is found anywhere. The possible errors are 

1 . # or tape symbol appearing on the wrong position, for example detecting 
symbol (0, s) used as a colurn number 

Qerr(X) : -#(X), X 3 |X 2 |*4X, (0,s)(X 3 ) 

Qerr(X) : — #(X), X 3 |X 2 |X4 + nX, ~ (Y, Xi), (0, s)(X 3 ) 



Similarly such rules can be constructed for next row, row and ff used a tape 
symbol. 

2. two consecutive # symbols, detected by Q err (X ) : —#(X),X]Y,#(Y). 

3. any node appears above the END, detected by Q err (X) : — END(A),F4_X 

4. any node appears below the START, detected by Q err (X) : —START(X), X\Y 

5. row number used in two different rows, detected by 

Qerr(X) : -#(X),YdX, Y 2 l + Y 1 , #(Y 2 ), Z±Y a , Z ~ Y x 

6 . the same column number twice in one row, detected by 

Qerr(X) : -#(X), ZdZ 2 lZd+Y 3 lY 2 lYdX 

Y\ ~ Zi,y 3 - z 3 

The last program works only if every row has distinct row number, which is 
ensured by previous rule. 

It is easy to see that Pm is matched in every node of any tree that contains 
one of described errors, and in the root node of those databases that contain 
correct encoding of a halting run of M. Moreover, the computation of Pm in 
those databases takes number of steps linearly proportional to the length of the 
encoding. Therefore, Pm is unbounded if and only if M has the arbitrary long 
halting run property. □ 

C .2 Boundedness on words and ranked trees 

Proof (of Lemma \14\ )- One implication is immediate. If V is bounded then it is 
equivalent to a union of conjunctive queries Q. The queries are connected so we 
can take n to be the size of the biggest query in Q. 

For the other implication, let us assume that V satisfies the condition: 

— there exists n > 0 such that for every word w and position X if X £ V(w) 
then X £ 'P(u), where v is the n- neighbour hood of X in w 

with n = no. We will construct a union of conjunctive queries Q equivalent to 
V. Recall that S-p denotes the set of labels that appear in the rules of program 
V. Let us consider all words of length smaller or equal 2no + 1 and treat them 
as structures over the signature {j,, ~} U E-p. These words have finitely many 
equality types. For each word v that satisfies V we add to 3 a query which 
defines the equality type of v. It remains to show that V and Q are equivalent. 
The containment of Q in V is straightforward from the construction of Q. Take 
a word w and position A' such that X £ V{w). Then X £V(v), where v is the 
n-neighbourhood of X in w. Since v is a word of lenglrt at most 2no +1 it follows 
that A" e Q(v), and hence A £ Q(w). □ 

We now move to the case of trees. First let us state the lemma equivalent to 
Lemma M for ranked trees. For a tree t , the n-neighbourhood of a node X is a 
subtree of t consisting of all nodes that are in distance at most n from X. 


Lemma 26. Let V be a DatalogQ,) program over ranked trees. Then the follow¬ 
ing conditions are equivalent: 

1. V is bounded, 

2. there exists n > 0 such that for every tree t and node X if X £ V{t) then 
X £ V{t'), where t' is the n-neighbourhood of X in t. 

Proof. The proof is analogous to the proof of Lemma [14] Let k be the rank of 
the considered trees. To show the implication from [2] to [T] it is enough to notice 
that for given n there are finitely many equality types (with respect to V) of 
trees of height at most 2n + 1 (and thus, finitely many of equality types of 71- 
neighbourhoods). The equality type of each such n-neighbourhood is definable 
by a CQ, and a UCQ equivalent to V is a union of those CQ’s that are contained 
in V. □ 

In the case of trees we define an n-witness for V to be a tree t such that 
there exists a node X in t for which A' £ V(t) but X (jL V(t'), where t' is the 
n-neighbourhood of X in t. A witness is a tree that is an n- witness for any n > 0. 

Corollary 27. A DatalogQ,) program V over ranked trees is unbounded iff there 
exist n-witnesses for arbitrarily big n > 0. 

We can now give the proof of Theorem [13] We restate it first. 

Theorem. The boundedness problem for Datalog(4_) over ranked trees is in 2- 
ExpTime. 

Proof. To prove Theorem [13] we first show that boundedness can be verified over 
ranked trees over a finite alphabet. 

Lemma 28. Let V be a DatalogQ.) program. Then V is bounded over ranked 
data trees with rank R over E iff V is bounded over ranked trees with the same 
rank over a finite alphabet Eq C E. The alphabet Eq contains E-p and | A7o\| < 

RW. 

Proof. This proof is a slight modification of a proof from HZI- If V is bounded 
over E then it is clearly bounded over any finite subset of E. Suppose that V 
is bounded over Eq but not bounded over E. Over Eq, V is therefore equivalent 
to a UCQ Q built of a finite number of proof words of V. Let f be a tree over 
E and X a node in t s.t. X £ V(t) but X ^ Q(t). We will show that t can be 
relabeled into a tree t' over Eq in a way preserving any label comparison done 
by the rules of V. Then, as Q is a union of proof words of V, it must also hold 
that X £ Q(t) iff X £ Q(t r ), which is a contradiction since V is not equivalent 
to Q over ranked trees over E. 

Let ?i be the size of the largest rule in V. Let B C E \ E-p be a set of size 
R\ v \. We set E 0 = B U E-p. We will describe a procedure that traverses the tree 
t in a top-down fashion, level by level, and changes the labels to elements of B. 
This way the set of processed nodes consists of i full levels starting from the 
root, and some nodes from the level i + 1. 


Let v be a node on level i + 1 - the next one to process, and let u be the 
node n — 1 edges up the tree (or the root if v is too close to the root). Suppose 
that the label of v is a. If a G B U Ep, we can finish processing v. Assume 
that a ^ B U Ep. Pick a label b G B that does not appear in the processed 
descendants of u, nor in u itself. We can always find such a label b because 
the number of processed descendants of u (including u itself) is bounded by 
ElTo 1 R l = ■%fr _ < R n < and so is the number of labels from B used 

in these nodes. Let c G E \ (B U Ep) be a fresh label. We now replace all 
appearances of b with c, but only in the unprocessed descendants of the node u. 
Observe that these nodes are separated from the nodes that keep their label b 
by distance at least n. Next, we replace all appearances of a with b, but only in 
the unprocessed descendants of u. Again, the distance from these nodes to the 
other nodes with label a or b is at least n. Thus, the modification does not affect 
the outcome of any label comparison done by rules in V (because they use only 
the short axis and are connected). After all nodes are processed, all labels in t' 
are from B U Ep. 

Let Eq be the finite alphabet from the previous Lemma. Now we can construct 
an automaton Wp, recognizing the set of witnesses for V. From Lemma [Til we 
get a two-way alternating tree automaton Bp which works over A7o x {0,1}, and 
accepts the set of trees that have only one node labeled with (a, 1) for a G A 0 , 
and the goal predicate of V is satisfied in this node. The size of this automaton 
is exponential in \V\. Let Ap be the bottom-up automaton recognizing L(Bp) 
obtained via Proposition llSI Let Afp be an automaton obtained by taking a prod¬ 
uct of the bottom-up automaton recognizing the complement of L{Bp) (again 
obtained via Proposition 1181) and the automaton checking that there is only one 
node in the tree with label (a, 1) for some a G E 0 . Then Afp accepts all trees 
over Eq for which V does not hold in the marked node. The size of both Ap and 
A fp is double exponential in [P\. 

With those two automata, the construction of Wp is easy. The set of states 
of Wp is 

Q{A V ) x ({e, OK} U Q(A/p)) 

where Q(A) denotes the set of states of the automaton A. Let 1 be a tree over 
Eq x {0,1} and let X denote the marked node. The automaton Wp starts in 
the state ( qj,e ), where qi is the initial state of Ap. Then Wp simulates Ap 
on t. In any node of a tree, the automaton Wp can guess that here begins the 
neighbourhood of X in which V does not hold. Then Wp changes the second 
component of its state from e to the initial state of A fp and simulates A fp on the 
guessed neighbourhood, verifying that indeed V does not hold in it. If Wp has 
reached an accepting state of A fp, it can guess that this node is the root of the 
neighbourhood and change the state to OK in the second component. Accepting 
states of Wp are states ( q , OK) where q is any accepting state of Ap. 

Similarly to the word case, if there exists a witness of size linear in the size 
of the automaton Wp, then there exist arbitrarily big witnesses. 



Lemma 29. Let N be the number of states of the automaton Wp. If there exists 
a (2N + 2)-witness for V , then there exist n-witnesses for arbitrary large n. The 
existence of (2 N + 2)-witness can be decided in time polynomial in N. 

Proof. We use a very similar pumping argument as in the word case. This time, 
however, to obtain arbitrarily big witnesses we need to be able to pump every 
path of the neighbourhood in which V is not satisfied. 

Suppose that there exists a ( 2N + 2)-witness and let X be the marked node. 
Then on every path of length 2N+ 2 from X downwards, some state of Wp must 
repeat, so we can pump the context between those nodes. Notice that some paths 
may be shorter, because the (2 N + 2)-witness may contain a leaf of the tree - 
we don’t need to pump those paths. On the path from X upwards of length 
N + 1 again some states of Wp repeat, and we can pump the context between 
the occurrences of the same state. This time, however, we need also to extend 
the paths that start on the pumped fragment and go downwards, but do not 
return to X. Every such path is of length at least N + 1 (that is why we need 
the 2 N + 2 size of the neighbourhood), so we can pump each of them (except 
for those that are shorter because they end with a leaf of the tree). 

To verify the existence of a (2 N + 2)-witness we modify the automaton Wp 
by adding two counters from 0 to 2 N + 2. When the automaton guesses the 
beginning of a neighbourhood of X in a non-leaf node Y it starts counting the 
length of the shortest path until the least common ancestor of Y and X is 
reached. The automaton in a node calculates the length of the shortest path as 
1 + the minimum of the values of the counters calculated for its children (if the 
value of the counter is 2N + 2, adding 1 does not change its value). When a 
neighbourhood of X begins in a leaf of the tree, the length of this path does 
not need to be 2N + 2, so the automaton sets the counter to 2N + 2 (that is 
- sufficient length). The second counter is used only for the nodes on the path 
above X and counts the length of the path for X to this node (for any other 
node in the guessed neighbourhood, value of this counter is 0). 

It is not difficult to see that using those two counters we can come up with 
an acceptance condition such that the modified automaton has an accepting run 
iff there exists a (2TV + 2)-witness for V. Since emptiness can be decided in time 
linear in the size of the automaton, we get the claim. □ 


Since the size of Wp is double exponential in \V\, we get a 2 -ExpTime procedure 
for deciding boundedness of V. □ 



